Worksheet 1: Working with One Dimensional Data

This worksheet covers concepts covered in the first half of Module 1 - Exploratory Data Analysis in One Dimension.

There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple.


Import the Libraries

For this exercise, we will be using:


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Exercise 1: Splitting and Filtering a Series

In this exercise, you are given a list of email addresses called emails. Your goal is to find the email accounts from domains that end in .edu. To accomplish this, you will need to:

  1. Filter the series to remove the emails that do not end in .edu
  2. Extract the accounts.

If you get stuck, refer to the documentation for Pandas string manipulation (http://pandas.pydata.org/pandas-docs/stable/text.html) or the slides. Note that there are various functions to accomplish this task.


In [3]:
emails = ['alawrence0@prlog.org',
'blynch1@businessweek.com',
'mdixon2@cmu.edu',
'rvasquez3@1688.com',
'astone4@creativecommons.org',
'mcarter5@chicagotribune.com',
'dcole6@vinaora.com',
'kpeterson7@topsy.com',
'ewebb8@cnet.com',
'jtaylor9@google.ru',
'ecarra@buzzfeed.com',
'jjonesb@arizona.edu',
'jbowmanc@disqus.com',
'eduardo_sanchezd@npr.org',
'emooree@prweb.com',
'eberryf@brandeis.edu',
'sgardnerh@wikipedia.org',
'balvarezi@delicious.com',
'blewisj@privacy.gov.au']

In [5]:
#Your code here...
emails_records = pd.Series(emails)
emails_records[emails_records.str.contains('.edu')].str.split('@').str[0]


Out[5]:
2     mdixon2
11    jjonesb
15    eberryf
dtype: object

Exercise 2: Applying a Function

In this exercise you are given a list of weights in pounds and a function to convert the measures into kilograms. For this exercise, apply the conversion function to the original series and convert the measures into kilograms.


In [7]:
weights = [31.09, 46.48, 24.0, 39.99, 19.33, 39.61, 40.91, 52.24, 30.77, 17.23, 34.87 ]
pd.Series(weights).apply(lambda x: x * 0.45359237)


Out[7]:
0     14.102187
1     21.082973
2     10.886217
3     18.139159
4      8.767941
5     17.966794
6     18.556464
7     23.695665
8     13.957037
9      7.815397
10    15.816766
dtype: float64

Exercise 3: Putting it all together

You are given a Series of IP Addresses and the goal is to limit this data to private IP addresses. Python has an ipaddress module which provides the capability to create, manipulate and operate on IPv4 and IPv6 addresses and networks. Complete documentation is available here: https://docs.python.org/3/library/ipaddress.html.

Here are some examples of how you might use this module:

import ipaddress
myIP = ipaddress.ip_address( '192.168.0.1' )
myNetwork = ipaddress.ip_network( '192.168.0.0/28' )

#Check membership in network
if myIP in myNetwork:  #This works
    print "Yay!"

#Loop through CIDR blocks
for ip in myNetwork:
    print( ip )

192.168.0.0
192.168.0.1


192.168.0.13
192.168.0.14
192.168.0.15

#Testing to see if an IP is private
if myIP.is_private:
    print( "This IP is private" )
else:
    print( "Routable IP" )
  1. First, write a function which takes an IP address and returns true if the IP is private, false if it is public. HINT: use the ipaddress module.
  2. Next, use this to create a Series of true/false values in the same sequence as your original Series.
  3. Finally, use this to filter out the original Series so that it contains only private IP addresses.

In [16]:
df_hosts = pd.Series([
    '192.168.1.2', '10.10.10.2', '172.143.23.34', 
    '34.34.35.34', '172.15.0.1', '172.17.0.1'])
df_hosts[df_hosts.apply(lambda x: ipaddress.ip_address(x).is_private)]


Out[16]:
0    192.168.1.2
1     10.10.10.2
5     172.17.0.1
dtype: object